Hadoop’s Overload Tolerant Design Exacerbates Failure Detection and Recovery∗

نویسندگان

  • Florin Dinu
  • T. S. Eugene Ng
چکیده

Data processing frameworks like Hadoop need to efficiently address failures, which are common occurrences in today’s large-scale data center environments. Failures have a detrimental effect on the interactions between the framework’s processes. Unfortunately, certain adverse but temporary conditions such as network or machine overload can have a similar effect. Treating this effect oblivious to the real underlying cause can lead to sluggish response to failures. We show that this is the case with Hadoop, which couples failure detection and recovery with overload handling into a conservative design with conservative parameter choices. As a result, Hadoop is oftentimes slow in reacting to failures and also exhibits large variations in response time under failure. These findings point to opportunities for future research on cross-layer data processing framework design.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Design of Fault Tolerant Comparator

In this paper we have presented a new design of fault tolerant comparator with a fault free hot spare. The aim of this design is to achieve a low overhead of time and area in fault tolerant comparators. We have used hot standby technique to normal operation of the system without interrupting and dynamic recovery method in fault detection and correction. The circuit is divided to smaller modules...

متن کامل

Analysis of Hadoop’s Performance under Failures

Failures are common in today’s data center environment and can significantly impact the performance of important jobs running on top of large scale computing frameworks. In this paper we analyze Hadoop’s behavior under compute node and process failures. Surprisingly, we find that even a single failure can have a large detrimental effect on job running times. We uncover several important design ...

متن کامل

Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology

The POWER4-based p690 systems offer the highest performance of the IBM eServer pSeries line of computers. Within the general-purpose UNIX server market, they also offer the highest levels of concurrent error detection, fault isolation, recovery, and availability. High availability is achieved by minimizing component failure rates through improvements in the base technology, and through design t...

متن کامل

Somersault Software Fault-Tolerance

software fault-tolerance, process replication failure masking, continuous availability, topology The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and...

متن کامل

Fault-Tolerant Wireless Multihop Transmissions with Byzantine Failure Detection

Wireless multihop networks consist of numbers of wireless nodes. Hence, introduction of failure detection and recovery is mandatory. Until now, various failure detection and recovery methods such as route switch and multiple routes detection have been proposed based on an assumption with stop failure model. However, the assumption that failed wireless nodes never transmit any messages is too re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011